Load libraries

library(limma)
library(Glimma)
library(tidyverse)
library(DESeq2)
## Warning: package 'S4Vectors' was built under R version 4.3.2
library(org.Hs.eg.db)
library(pheatmap)
library(EnhancedVolcano)
library(ComplexHeatmap)
library(gplots)
library(msigdbr)
library(clusterProfiler)
## Warning: package 'clusterProfiler' was built under R version 4.3.2
library(enrichplot)
## Warning: package 'enrichplot' was built under R version 4.3.2
library(DOSE)
## Warning: package 'DOSE' was built under R version 4.3.2

Import data

counts <- read_rds(file = "counts_matrix_raw.rds")
coldata <- read_rds(file = "coldata_dataframe.rds")
identical(colnames(counts), coldata$sample)
## [1] TRUE

Discussion of the project

This RNASeq dataset is to complement other assays conducting in determining the role of IL-21 in stimulating gd T cells to induce anti-inflammatory IL-10.

“When cultured in the presence of IL-21, Vγ9/Vδ2 T-cells acquired the ability to induce expression of the immunoregulatory cytokine IL-10 in both naïve and memory CD4+ T-cells…” (M Eberl, Unpublished)

This dataset contains PBMC extractions from three healthy donors. The gd T cells have been isolated and cultured under various stimuli: Unstimulated, IL-21, IL-2, IL-15.

While valuable information can be gleaned from this dataset, it is essential to acknowledge its limitations, given its small cohort size (n = 3). Preventing further, extensive analysis is the observation that one of the healthy donors (D3) diverges from the pattern observed in the other two donors. The unique gene expression pattern in D3 poses a challenge in discerning whether this deviation is attributable to technical or biological variation, given the cohort’s limited size. The issues identified in D3 are detailed in the supplementary section of this markdown. Despite the absence of a clear rationale for excluding D3, it has been retained in the analysis. While it may have potentially suppressed the differential expression of certain genes, we anticipate that this will have limited effect as we focus on genes that exhibit the most differential expression between groups. Despite potential confounding from D3, we expect to identify the most differentially expressed genes. Our particular interest lies in genes related to cell-cell signaling, protein expression, and immunity; therefore, our analysis and presentation of data will emphasise this aspect of the data.

Given the challenges associated with the dataset, we place limited emphasis on analyses beyond differential expression. Brief pathway analyses have been conducted and are presented in the supplementary section of this document.

This dataset, along with other assays included in this project, indicates a mechanistic change in gd T cells during IL-21 stimulation but not in IL-2 and IL-15 stimulation. Therefore, the objective is to conduct a Wald test analysis of the log fold change of genes, comparing IL-21 vs IL-2 and IL-21 vs IL-15, with a specific focus on up-regulated genes in IL-21. Our focus on IL-21 up-regulated genes stems from our expectation of an active role of gd T cells in mechanistic interactions with CD4 T and CD4 T naive cells.

We aim to present a table of differentially expressed genes and visualize these results through a volcano plot. In particular, we are interested in genes related to immunity and cell signaling, and these immune-related genes will be appropriately labeled. This approach facilitates a discussion of differentially expressed genes with a focus on immune-related aspects, specifically CXCL13 and TNFRS8 (CD30).

The data pre-processing was conducted using NF-Core RNASeq nextflow package, and a count table produced using RSubread package. Full configuration and version information will be declared before publication.

Rename gene list

gene.list <- data.frame(entrez.id = rownames(counts))
gene.list <- gene.list %>%
  dplyr::mutate(
    SYMBOL = mapIds(org.Hs.eg.db, keys = gene.list$entrez.id, column = "SYMBOL", keytype = "ENTREZID"),
    GENETYPE = mapIds(org.Hs.eg.db, keys = gene.list$entrez.id, column = "GENETYPE", keytype = "ENTREZID"),
    ENSEMBL = mapIds(org.Hs.eg.db, keys = gene.list$entrez.id, column = "ENSEMBL", keytype = "ENTREZID")
  )

# Rename to SYMBOL
filtered.gene.list <- gene.list %>%
  dplyr::filter(
    !is.na(ENSEMBL),
    !duplicated(ENSEMBL))

# Rename gene IDs from ensembl to symbol
counts <- counts[rownames(counts) %in% filtered.gene.list$entrez.id, ]
rownames(counts) <- filtered.gene.list$SYMBOL[match(rownames(counts), filtered.gene.list$entrez.id)]

Generate DESeq2 object

dds <- DESeqDataSetFromMatrix(counts,coldata,design = ~healthy_donor + condition)
dds <- estimateSizeFactors(dds)
idx <- rowSums(counts(dds, normalized=TRUE) >= 5 ) >= 3
dds <- dds[idx,]
dds <- DESeq(dds)
vst <- varianceStabilizingTransformation(dds)
vsn::meanSdPlot(assay(vst))

plotPCA(vst)

sizeFactors(dds)
##   MERNA01   MERNA02   MERNA03   MERNA05   MERNA06   MERNA07   MERNA08   MERNA10 
## 0.3877391 0.2741379 1.1983553 1.6851257 1.5174987 0.6571399 1.6005437 1.4827085 
##   MERNA11   MERNA12   MERNA13   MERNA15 
## 1.2401335 1.8634610 0.6797028 1.3903766
normalized_counts <- counts(dds, normalized=TRUE)

DESeq2 results - Generalised linear model (Wald test)

res.il15 <- results(dds,contrast = c("condition","IL21","IL15"))
res.sig.il15 <- subset(res.il15, padj < 0.1)
lfc.il15 <- res.sig.il15[ order(res.sig.il15$log2FoldChange, decreasing=TRUE), ]
summary(lfc.il15)
## 
## out of 628 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 139, 22%
## LFC < 0 (down)     : 489, 78%
## outliers [1]       : 0, 0%
## low counts [2]     : 0, 0%
## (mean count < 4)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
lfc.il15
## log2 fold change (MLE): condition IL21 vs IL15 
## Wald test p-value: condition IL21 vs IL15 
## DataFrame with 628 rows and 6 columns
##           baseMean log2FoldChange     lfcSE      stat      pvalue        padj
##          <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
## CXCL13    79.85535        8.51801   2.36880   3.59592 3.23253e-04 0.018058412
## COL6A3   394.68891        6.92096   1.40108   4.93973 7.82315e-07 0.000352279
## IL17F      6.75758        4.71570   1.50876   3.12554 1.77477e-03 0.057084538
## IGHV3-15  19.76527        4.60317   1.36282   3.37768 7.31006e-04 0.031395218
## TNFRSF8  419.26348        4.47405   1.45738   3.06992 2.14118e-03 0.063763528
## ...            ...            ...       ...       ...         ...         ...
## RGS8       18.6179       -6.36659   1.73978  -3.65941 2.52792e-04 0.015025960
## HSPA4L     23.3843       -6.36932   1.50293  -4.23794 2.25575e-05 0.003007899
## OLAH       37.6269       -7.54612   1.58918  -4.74843 2.05006e-06 0.000672316
## HMCN1      96.6456       -7.68855   1.61109  -4.77227 1.82158e-06 0.000629505
## ENPEP      66.1783       -7.90946   2.16317  -3.65642 2.55758e-04 0.015141688
res.il2 <- results(dds,contrast = c("condition","IL21","IL2"))
res.sig.il2 <- subset(res.il2, padj < 0.1)
lfc.il2 <- res.sig.il2[ order(res.sig.il2$log2FoldChange, decreasing=TRUE), ]
summary(lfc.il2)
## 
## out of 2177 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 850, 39%
## LFC < 0 (down)     : 1327, 61%
## outliers [1]       : 0, 0%
## low counts [2]     : 0, 0%
## (mean count < 6)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
lfc.il2
## log2 fold change (MLE): condition IL21 vs IL2 
## Wald test p-value: condition IL21 vs IL2 
## DataFrame with 2177 rows and 6 columns
##           baseMean log2FoldChange     lfcSE      stat      pvalue        padj
##          <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
## CXCL13    79.85535        9.67648   2.40719   4.01983 5.82404e-05 1.85554e-03
## COL6A3   394.68891        6.99235   1.38942   5.03257 4.83942e-07 3.63206e-05
## IGHV3-74   6.11918        5.91089   1.83752   3.21678 1.29637e-03 2.02523e-02
## FCRL5     30.46544        5.78625   1.19297   4.85030 1.23277e-06 8.08520e-05
## IGKV1-9   32.67039        5.27330   2.06343   2.55560 1.06005e-02 8.04712e-02
## ...            ...            ...       ...       ...         ...         ...
## HSPA4L     23.3843       -7.89716   1.48374  -5.32247 1.02366e-07 9.87050e-06
## OLAH       37.6269       -8.38289   1.58359  -5.29358 1.19943e-07 1.14142e-05
## HMCN1      96.6456       -8.79858   1.60829  -5.47075 4.48135e-08 5.01912e-06
## DNAJC22    39.4756       -8.82934   1.51177  -5.84040 5.20769e-09 9.02666e-07
## ENPEP      66.1783       -9.40826   2.15815  -4.35941 1.30413e-05 5.51979e-04
lfc.il15.df <- lfc.il15 %>% as.data.frame() %>% rownames_to_column(.,var = 'GENE_ID')
lfc.il2.df <- lfc.il2 %>% as.data.frame() %>% rownames_to_column(.,var = 'GENE_ID')
write_csv(lfc.il15.df, "lfc_il15_results.csv")
write_csv(lfc.il2.df, "lfc_il2_results.csv")

Venn diagram common genes that are immunologically relevent (msigDB C7 genesets)

To extract a list of interesting genes which is DE in both conditions

We now have many differentially expressed genes. To further focus the investigation and subset for informative genes, we have used the C7 immune signature database to provide genes that are known to be related in some capacity to immune regulation. The following chapter also shows the differentially expressed genes that are shared between both Wald tests conducted above. Therefore, we aim to have a list of genes that are found in the C7 immune signature database, in the IL-21 vs IL-2, and in the IL-21 vs IL-15 Wald tests. This will narrow down the investigation for commonly expressed genes in IL-21 that are also found to be immunologically relevant.

## Data preparation
il15.degs <- row.names(lfc.il15)
il2.degs <- row.names(lfc.il2)
list.degs <- unique(paste(c(il15.degs,il2.degs)))

# Create a Venn-diagram given just the list of gene-names for both sets
venn(list("IL15vsIL21" = il15.degs, "IL2vsIL21" = il2.degs), )

common_genes <- Reduce(intersect, list(il2.degs, il15.degs))

db.gs.c7 <- msigdbr(species = "human", category = "C7", subcategory = "IMMUNESIGDB")
c7.common.genes <- common_genes[common_genes %in% unique(db.gs.c7$gene_symbol)]

venn(list("IL15vsIL21" = il15.degs, "IL2vsIL21" = il2.degs, "c7genes" = c7.common.genes), )

DT::datatable(as.data.frame(c7.common.genes))

Volcano plot

Finally, taking the list of informative genes that we have obtained from the previous chapter, we show a volcano plot of differentially expressed genes. To note, that the padj has been left more inclusive, at 0.07, rather than the typical 0.05. This is due some observations not showing as classically statistically significant, but still being of note, as there is a substantial log2FoldChange.

# Set fold change and p-value cutoffs
fold_cutoff <- 2.5
pvalue_cutoff <- 0.07

# Filter data based on cutoffs
res.il2.labels <- subset(res.il2, abs(log2FoldChange) > fold_cutoff & -log10(padj) > -log10(pvalue_cutoff))
res.il15.labels <- subset(res.il15, abs(log2FoldChange) > fold_cutoff & -log10(padj) > -log10(pvalue_cutoff))
filtered_data_intersect <- intersect(intersect(rownames(res.il2.labels), rownames(res.il15.labels)), c7.common.genes)
message(paste0("Length of C7 immune genes common in both IL15 and IL2 comparisons, \n that satisfy the padj and FC thresholds: ",length(filtered_data_intersect)))
## Length of C7 immune genes common in both IL15 and IL2 comparisons, 
##  that satisfy the padj and FC thresholds: 97
EnhancedVolcano(res.il2,
    lab = rownames(res.il2),
    selectLab = filtered_data_intersect,
    drawConnectors = TRUE,
    x = 'log2FoldChange',
    y = 'padj',
    title = 'IL21 versus IL2',
    pCutoff = pvalue_cutoff,
    FCcutoff = fold_cutoff,
    pointSize = 3.0,
    boxedLabels = TRUE,
    labSize = 6.0)
## Warning: ggrepel: 43 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

EnhancedVolcano(res.il15,
    lab = rownames(res.il15),
    selectLab = filtered_data_intersect,
    drawConnectors = TRUE,
    x = 'log2FoldChange',
    y = 'padj',
    title = 'IL21 versus IL15',
    pCutoff = pvalue_cutoff,
    FCcutoff = fold_cutoff,
    pointSize = 3.0,
    boxedLabels = TRUE,
    labSize = 6.0)
## Warning: ggrepel: 57 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Supplementary analysis

Many other types of exploratory analysis was conducted on this dataset, but as mentioned earlier in the document, a small cohort size limits the extent further statistical methods can be applied.

For reference, a variance heatmap and a pathway analysis has been conducted on this dataset and can be seen below. The variance heatmap shows the extent to which D3 does not follow the same pattern of gene expression as the other healthy donor samples. The pathway analysis shows some indication of immune-related signalling changes, but the current dataset appears to be underpowered to draw significant conclusions from the data.

Total variance heatmap

vst_data <- assay(vst)
topVarGenes <- order(rowVars(vst_data), decreasing = TRUE)
mat  <- vst_data[topVarGenes[1:20], ]
# mat  <- vst_data[topVarGenes, ]
mat  <- mat - rowMeans(mat)

anno <- as.data.frame(colData(dds))
annocols1 <- paste(c(unique(anno$condition)))
annocols2 <- paste(c(unique(anno$healthy_donor)))
vip.cols1 <- RColorBrewer::brewer.pal(name = "Dark2", n = length(annocols1))
vip.cols2 <- RColorBrewer::brewer.pal(name = "Set2", n = length(annocols2))
names(vip.cols1) = sort(unique(anno$condition))
names(vip.cols2) = sort(unique(anno$healthy_donor))

HA <- columnAnnotation(df = anno %>% dplyr::select(c("condition","healthy_donor")), col = list(condition=vip.cols1,healthy_donor=vip.cols2))

Heatmap(mat, 
        name = "Relative variance",
        column_title = "Filtered counts matrix \n Variance stabilised expression heatmap \n Gene expression relative to the mean", 
        top_annotation = HA, 
        cluster_columns = TRUE, 
        #column_split = anno$healthy_donor,
        cluster_rows = TRUE
        )

Brief pathway analysis using GO database

# reading in data from deseq2
df <- res.il2
df <- data.frame(df)

# we want the log2 fold change 
original_gene_list <- df %>% rownames_to_column() %>% dplyr::select(c("rowname","log2FoldChange"))

# name the vector
names(original_gene_list) <- df$X

# omit any NA values 
gene_list<-na.omit(original_gene_list)

geneList = df %>% dplyr::select("log2FoldChange")
gene_list_ordered <- gene_list[order(gene_list[,2], decreasing = TRUE), ]
rownames(gene_list_ordered) <- gene_list_ordered[,1]
gene_list_ordered <- gene_list_ordered[,2]
names(gene_list_ordered) <- rownames(df)
gse <- gseGO(geneList=gene_list_ordered,
             ont ="ALL", 
             keyType = "SYMBOL", 
             minGSSize = 3, 
             maxGSSize = 800, 
             pvalueCutoff = 0.05, 
             verbose = TRUE, 
             OrgDb = org.Hs.eg.db,
             pAdjustMethod = "none")
## preparing geneSet collections...
## GSEA analysis...
## leading edge analysis...
## done...
dotplot(gse, showCategory=20
        , split=".sign") +
  facet_grid(.~.sign)

retrieved <- AnnotationDbi::select(org.Hs.eg.db, keytype="GOALL", keys=c("GO:0042101"), columns="SYMBOL")
## 'select()' returned 1:many mapping between keys and columns
head(retrieved)
##        GOALL EVIDENCEALL ONTOLOGYALL SYMBOL
## 1 GO:0042101         IBA          CC   CD3D
## 2 GO:0042101         IDA          CC   CD3D
## 3 GO:0042101         IPI          CC   CD3D
## 4 GO:0042101         ISS          CC   CD3D
## 5 GO:0042101         IBA          CC   CD3E
## 6 GO:0042101         IDA          CC   CD3E
retrieved <- unique(retrieved$SYMBOL)

R session info

sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.0
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] grid      stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] DOSE_3.28.1                 enrichplot_1.22.0          
##  [3] clusterProfiler_4.10.0      msigdbr_7.5.1              
##  [5] gplots_3.1.3                ComplexHeatmap_2.18.0      
##  [7] EnhancedVolcano_1.20.0      ggrepel_0.9.4              
##  [9] pheatmap_1.0.12             org.Hs.eg.db_3.18.0        
## [11] AnnotationDbi_1.64.1        DESeq2_1.42.0              
## [13] SummarizedExperiment_1.32.0 Biobase_2.62.0             
## [15] MatrixGenerics_1.14.0       matrixStats_1.1.0          
## [17] GenomicRanges_1.54.1        GenomeInfoDb_1.38.0        
## [19] IRanges_2.36.0              S4Vectors_0.40.2           
## [21] BiocGenerics_0.48.1         lubridate_1.9.3            
## [23] forcats_1.0.0               stringr_1.5.0              
## [25] dplyr_1.1.4                 purrr_1.0.2                
## [27] readr_2.1.4                 tidyr_1.3.0                
## [29] tibble_3.2.1                ggplot2_3.4.4              
## [31] tidyverse_2.0.0             Glimma_2.12.0              
## [33] limma_3.58.1               
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.3.1           bitops_1.0-7            ggplotify_0.1.2        
##   [4] filelock_1.0.2          polyclip_1.10-6         preprocessCore_1.64.0  
##   [7] lifecycle_1.0.3         edgeR_4.0.1             doParallel_1.0.17      
##  [10] vroom_1.6.4             lattice_0.22-5          MASS_7.3-60            
##  [13] crosstalk_1.2.1         magrittr_2.0.3          sass_0.4.7             
##  [16] rmarkdown_2.25          jquerylib_0.1.4         yaml_2.3.7             
##  [19] cowplot_1.1.1           DBI_1.1.3               RColorBrewer_1.1-3     
##  [22] abind_1.4-5             zlibbioc_1.48.0         ggraph_2.1.0           
##  [25] RCurl_1.98-1.12         yulab.utils_0.1.0       tweenr_2.0.2           
##  [28] circlize_0.4.15         GenomeInfoDbData_1.2.11 tidytree_0.4.5         
##  [31] codetools_0.2-19        DelayedArray_0.28.0     DT_0.30                
##  [34] ggforce_0.4.1           tidyselect_1.2.0        shape_1.4.6            
##  [37] aplot_0.2.2             farver_2.1.1            viridis_0.6.4          
##  [40] BiocFileCache_2.10.1    jsonlite_1.8.7          GetoptLong_1.0.5       
##  [43] ellipsis_0.3.2          tidygraph_1.2.3         iterators_1.0.14       
##  [46] foreach_1.5.2           tools_4.3.1             treeio_1.26.0          
##  [49] Rcpp_1.0.11             glue_1.6.2              gridExtra_2.3          
##  [52] SparseArray_1.2.2       xfun_0.41               qvalue_2.34.0          
##  [55] withr_2.5.2             BiocManager_1.30.22     fastmap_1.1.1          
##  [58] fansi_1.0.5             caTools_1.18.2          digest_0.6.33          
##  [61] timechange_0.2.0        R6_2.5.1                gridGraphics_0.5-1     
##  [64] colorspace_2.1-0        Cairo_1.6-2             GO.db_3.18.0           
##  [67] gtools_3.9.5            RSQLite_2.3.2           hexbin_1.28.3          
##  [70] utf8_1.2.4              generics_0.1.3          data.table_1.14.8      
##  [73] graphlayouts_1.0.2      httr_1.4.7              htmlwidgets_1.6.3      
##  [76] S4Arrays_1.2.0          scatterpie_0.2.1        pkgconfig_2.0.3        
##  [79] gtable_0.3.4            blob_1.2.4              XVector_0.42.0         
##  [82] shadowtext_0.1.2        htmltools_0.5.7         fgsea_1.28.0           
##  [85] clue_0.3-65             scales_1.2.1            png_0.1-8              
##  [88] ggfun_0.1.3             knitr_1.45              rstudioapi_0.15.0      
##  [91] tzdb_0.4.0              reshape2_1.4.4          rjson_0.2.21           
##  [94] nlme_3.1-163            curl_5.1.0              cachem_1.0.8           
##  [97] GlobalOptions_0.1.2     KernSmooth_2.23-22      parallel_4.3.1         
## [100] HDO.db_0.99.1           vsn_3.70.0              pillar_1.9.0           
## [103] vctrs_0.6.4             dbplyr_2.4.0            cluster_2.1.4          
## [106] evaluate_0.23           cli_3.6.1               locfit_1.5-9.8         
## [109] compiler_4.3.1          rlang_1.1.1             crayon_1.5.2           
## [112] labeling_0.4.3          affy_1.80.0             plyr_1.8.9             
## [115] fs_1.6.3                stringi_1.7.12          viridisLite_0.4.2      
## [118] BiocParallel_1.36.0     babelgene_22.9          munsell_0.5.0          
## [121] Biostrings_2.70.1       lazyeval_0.2.2          GOSemSim_2.28.0        
## [124] Matrix_1.6-3            hms_1.1.3               patchwork_1.1.3        
## [127] bit64_4.0.5             KEGGREST_1.42.0         statmod_1.5.0          
## [130] highr_0.10              igraph_1.5.1            memoise_2.0.1          
## [133] affyio_1.72.0           bslib_0.6.1             ggtree_3.10.0          
## [136] fastmatch_1.1-4         bit_4.0.5               ape_5.7-1              
## [139] gson_0.1.0